FML HACKATHON :

OVERVIEW OF THE DATASET :

Importing the Libraries :

In [ ]:
# importing the libraries 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report,f1_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import VotingClassifier

import warnings
warnings.filterwarnings("ignore")

from google.colab import files

random_state = 100
In [ ]:
# to display entire rows and columns of dataframe 
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)

Reading the Data :

In [ ]:
'''
Insert Correct Path as per location of train_input.csv and test_input.csv 
'''
path = "/content/"

# read the train data
train_df = pd.read_csv(path + "train_input.csv")
# read the test data 
test_df = pd.read_csv(path + "test_input.csv")
In [ ]:
# storing the original data 
train_org=train_df.copy()
test_org=test_df.copy()
In [ ]:
# sample of train data
train_df.head(10)
Out[ ]:
Feature 1 (Discrete) Feature 2 (Discrete) Feature 3 (Discrete) Feature 4 (Discrete) Feature 5 (Discrete) Feature 6 (Discrete) Feature 7 (Discrete) Feature 8 (Discrete) Feature 9 Feature 10 Feature 11 Feature 12 Feature 13 Feature 14 Feature 15 Feature 16 Feature 17 Feature 18 Feature 19 (Discrete) Feature 20 (Discrete) Feature 21 (Discrete) Feature 22 (Discrete) Feature 23 (Discrete) Feature 24 Target Variable (Discrete)
0 1404 12 64 14 3 1 1 1 110.502 35775.2 35797.1 0.000261 0.172 1436.052 5000.5 NaN NaN 15.04 104 12 2 32 1409 37677.1 1
1 909 0 235 32 1 1 1 1 -40.448 35779.4 35794.3 0.000178 0.032 1436.111 3720.5 2200.3 4900.005 12.03 20 1 0 13 909 25239.1 1
2 654 3 175 2 1 1 1 1 -27.445 35770.4 35803.3 0.000391 0.021 1436.103 4685.4 1973.3 10000.004 13.01 1 1 0 13 654 27683.5 1
3 1372 12 382 14 2 0 1 0 0.001 509.2 513.5 0.000291 97.541 94.844 NaN NaN NaN NaN 313 12 10 54 1377 39363.2 0
4 786 3 199 2 1 0 1 0 0.001 612.1 697.3 0.006050 97.981 97.823 4.1 NaN NaN NaN 171 1 5 11 786 40044.4 2
5 811 20 209 54 3 0 3 3 0.003 853.4 868.3 0.001040 20.002 102.203 1000.4 NaN NaN 5.03 39 18 6 16 811 37838.1 0
6 805 0 206 52 3 0 1 0 0.004 984.2 1014.1 0.002040 99.202 105.102 47.1 NaN 56.003 43864.03 180 47 3 74 805 27004.2 2
7 1129 0 120 14 3 0 1 0 0.004 661.2 673.1 0.000853 98.062 98.081 NaN NaN NaN NaN 104 12 10 41 1134 39210.5 2
8 1091 3 277 2 2 0 2 3 0.002 1016.3 1203.1 0.012500 63.404 107.405 5000.5 NaN NaN NaN 3 1 1 35 1092 28541.3 0
9 1118 0 283 2 1 1 1 1 108.024 35783.4 35790.3 0.000083 0.031 1436.103 4007.4 NaN NaN 15.03 2 1 3 36 1123 34941.5 1
In [ ]:
# shape of datasets 
print("\nTrain dataset shape : ",train_df.shape)
print("\nTest dataset shape : ",test_df.shape)
Train dataset shape :  (994, 25)

Test dataset shape :  (426, 24)
In [ ]:
# info about the dataset 
train_df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 994 entries, 0 to 993
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Feature 1 (Discrete)        994 non-null    int64  
 1   Feature 2 (Discrete)        994 non-null    int64  
 2   Feature 3 (Discrete)        994 non-null    int64  
 3   Feature 4 (Discrete)        994 non-null    int64  
 4   Feature 5 (Discrete)        994 non-null    int64  
 5   Feature 6 (Discrete)        994 non-null    int64  
 6   Feature 7 (Discrete)        994 non-null    int64  
 7   Feature 8 (Discrete)        994 non-null    int64  
 8   Feature 9                   980 non-null    float64
 9   Feature 10                  993 non-null    float64
 10  Feature 11                  993 non-null    float64
 11  Feature 12                  993 non-null    float64
 12  Feature 13                  993 non-null    float64
 13  Feature 14                  993 non-null    float64
 14  Feature 15                  922 non-null    float64
 15  Feature 16                  325 non-null    float64
 16  Feature 17                  448 non-null    float64
 17  Feature 18                  664 non-null    float64
 18  Feature 19 (Discrete)       994 non-null    int64  
 19  Feature 20 (Discrete)       994 non-null    int64  
 20  Feature 21 (Discrete)       994 non-null    int64  
 21  Feature 22 (Discrete)       994 non-null    int64  
 22  Feature 23 (Discrete)       994 non-null    int64  
 23  Feature 24                  993 non-null    float64
 24  Target Variable (Discrete)  994 non-null    int64  
dtypes: float64(11), int64(14)
memory usage: 194.3 KB

NOTE :

  1. Here there were 24 input features among which 13 features were of datatype int and remaining 11 were of type float. Target variable is a Discrete feature.

  2. Feature (9-18) and feature 24 has missing values and it is needed to be taken care of.

In [ ]:
# statistics about the dataset
train_df.describe(percentiles=[0.01,0.1,0.9,0.95,0.99])
Out[ ]:
Feature 1 (Discrete) Feature 2 (Discrete) Feature 3 (Discrete) Feature 4 (Discrete) Feature 5 (Discrete) Feature 6 (Discrete) Feature 7 (Discrete) Feature 8 (Discrete) Feature 9 Feature 10 Feature 11 Feature 12 Feature 13 Feature 14 Feature 15 Feature 16 Feature 17 Feature 18 Feature 19 (Discrete) Feature 20 (Discrete) Feature 21 (Discrete) Feature 22 (Discrete) Feature 23 (Discrete) Feature 24 Target Variable (Discrete)
count 994.000000 994.000000 994.000000 994.000000 994.000000 994.000000 994.000000 994.000000 980.000000 993.000000 993.000000 9.930000e+02 993.000000 993.000000 922.000000 325.000000 448.000000 664.000000 994.000000 994.000000 994.000000 994.000000 994.000000 993.000000 994.000000
mean 708.187123 5.899396 159.564386 11.650905 2.623742 0.581489 2.041247 1.642857 6.113291 14527.974220 16057.136858 2.019905e-02 50.015251 642.128365 2042.519523 1329.779692 3814.420516 2061.806852 94.004024 7.659960 4.776660 45.634809 709.334004 34605.373112 1.706237
std 405.826060 7.563357 106.705581 15.159370 2.652267 0.746863 2.550459 1.531875 54.659315 16446.925641 18976.592514 1.109327e-01 41.786676 659.421311 2285.714865 1265.242499 4645.866095 9279.644009 90.310887 10.658154 4.773882 36.069872 407.360827 6120.845597 2.417255
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -177.116000 200.400000 322.500000 1.000000e-08 0.001000 7.014000 0.200000 6.500000 4.055000 0.270000 0.000000 0.000000 0.000000 0.000000 0.000000 7530.300000 0.000000
1% 12.930000 0.000000 2.000000 1.000000 0.000000 0.000000 0.000000 0.000000 -125.720380 356.492000 403.420000 1.095120e-05 0.002920 92.503920 1.300000 12.300000 20.003410 1.016300 1.000000 1.000000 0.000000 0.000000 12.930000 23464.884000 0.000000
10% 149.300000 0.000000 29.000000 2.000000 1.000000 0.000000 1.000000 0.000000 -50.247800 495.640000 538.500000 9.490000e-05 0.031000 94.907600 5.100000 77.180000 160.002700 3.043000 2.000000 1.000000 0.000000 10.000000 149.300000 25481.140000 0.000000
50% 705.000000 3.000000 146.000000 5.000000 2.000000 0.000000 1.000000 1.000000 0.003000 1407.300000 1415.100000 5.340500e-04 54.905000 114.101000 1360.100000 980.400000 1500.003000 10.050000 73.500000 2.000000 3.000000 39.000000 705.000000 37265.200000 1.000000
90% 1273.700000 15.000000 321.700000 27.700000 6.000000 1.000000 4.000000 4.000000 81.203700 35780.500000 35805.100000 7.354034e-03 98.179000 1436.109800 5145.320000 2515.720000 12000.004000 15.050000 238.000000 18.000000 12.000000 103.000000 1278.700000 40959.220000 6.000000
95% 1341.700000 20.000000 367.000000 49.000000 11.000000 2.000000 6.000000 4.000000 112.996450 35784.440000 35861.940000 2.174004e-02 98.545200 1436.135000 5990.050000 3442.960000 13000.005000 18.048500 276.000000 32.000000 15.000000 118.700000 1346.700000 41337.000000 6.000000
99% 1401.140000 38.000000 382.000000 73.000000 11.000000 3.000000 14.070000 7.000000 164.422420 35794.472000 48932.836000 7.253600e-01 120.303080 1440.109520 6959.182000 3955.076000 18689.004530 44021.412600 313.000000 55.000000 18.000000 148.140000 1406.140000 41610.296000 14.000000
max 1412.000000 46.000000 386.000000 80.000000 14.000000 4.000000 19.000000 7.000000 328.502000 37778.400000 156833.300000 8.640000e-01 143.402000 4032.863000 18000.300000 10000.400000 20000.005000 44118.010000 318.000000 58.000000 23.000000 155.000000 1417.000000 41634.300000 17.000000
In [ ]:
# target variable 
print(train_df['Target Variable (Discrete)'].value_counts())
train_df['Target Variable (Discrete)'].value_counts().plot(kind='barh',ylabel="Output Categories")
1     488
0     249
2     109
6      70
5      41
8       7
7       5
14      5
15      4
13      3
4       3
3       3
9       2
11      1
10      1
16      1
12      1
17      1
Name: Target Variable (Discrete), dtype: int64
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7d9dd59390>

Note :

  1. Class 1,0,2,6,5 have some good number of occurences.

  2. Remaining classes 3,4,7,8,9,10,11,12,13,14,15,16,17 have very few occurences.

DATA VISUALIZATION :

Correlation Plot -- HEATMAP:

In [ ]:
plt.figure(figsize=(20,10))
sns.heatmap(train_df.drop(columns='Target Variable (Discrete)',axis=1).corr(),annot=True)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7d9dd592e8>

Note :

  1. Here the correlation between feature 1 and feature 23 is 1. So having both the feature is not so useful for further analysis and anyone could be removed.

  2. Similarly Correlation between features(11,14) and feature(15,16) also very high and so anyone could be removed from both sets as having both will lead to redundant feature.

In [ ]:
# get the columns from train dataset 
cols_train=train_df.columns.tolist()
# removing the target varibale 
cols_train.remove('Target Variable (Discrete)')
# getting the discrete columns 
discrete_cols=[x for x in cols_train if "Discrete" in x]
# getting the continuous columns 
continuous_cols=[x for x in cols_train if "Discrete" not in x]

print("\nNumber of columns : ",len(cols_train))
print("\nNumber of Discrete Columns : ",len(discrete_cols))
print("\nNumber of Continuous Columns : ",len(continuous_cols))
Number of columns :  24

Number of Discrete Columns :  13

Number of Continuous Columns :  11
In [ ]:
def discrete_function(cols):
  """
  Input : cols - features of given input data
  returns : 1 if it is categorical feature 
            0 if it is not a categorical feature 
  """
  l=len(train_df[cols].value_counts())
  if l<=20:
    return 1
  else:
    return 0
In [ ]:
discrete_cat=[]
discrete_cont=[]
for x in discrete_cols:
  d=discrete_function(x)
  if d==1:
    discrete_cat.append(x)
  else:
    discrete_cont.append(x)

Note :

  1. The discrete features (5,6,7,8,21) were the categorical features.

  2. Other discrete features(1,2,3,4,19,20,22,23) were not like categorical features.

Univariate and Bivariate Analysis:

In [ ]:
def count_plot(discrete_cat):
  """
  input : discrete_cat : list containing the discrete column 
  plots : count plots of all the features given in the discrete_cat list 
  """
  fig, ((ax1,ax2),(ax3,ax4)) = plt.subplots(nrows=2,ncols=2,figsize=(25,12)) # figure of 2 rows and 3 columns subplots 
  # count plots 
  s2=sns.countplot(x=discrete_cat[0],data=train_df,ax=ax1,palette='magma')
  s3=sns.countplot(x=discrete_cat[1],data=train_df,ax=ax2,palette='crest')
  s4=sns.countplot(x=discrete_cat[2],data=train_df,ax=ax3,palette='cubehelix')
  s5=sns.countplot(x=discrete_cat[3],data=train_df,ax=ax4,palette='Spectral')
  # annotating the labels 
  for p in s2.patches:
    s2.annotate(format(p.get_height(), ''), (p.get_x() + p.get_width()/2 , p.get_height()), ha = 'center', va = 'center', xytext = (0,10),rotation=0, textcoords = 'offset points',fontsize=12)
  for p in s3.patches:
    s3.annotate(format(p.get_height(), ''), (p.get_x() + p.get_width()/2 , p.get_height()), ha = 'center', va = 'center', xytext = (0,10),rotation=0, textcoords = 'offset points',fontsize=12)
  for p in s4.patches:
    s4.annotate(format(p.get_height(), ''), (p.get_x() + p.get_width()/2 , p.get_height()), ha = 'center', va = 'center', xytext = (0,10),rotation=45, textcoords = 'offset points',fontsize=12)
  for p in s5.patches:
    s5.annotate(format(p.get_height(), ''), (p.get_x() + p.get_width()/2 , p.get_height()), ha = 'center', va = 'center', xytext = (0,10),rotation=45, textcoords = 'offset points',fontsize=12)
  
In [ ]:
count_plot(discrete_cat)

Note :

In most of categorical features some category entries were very less (in single digit count).

In [ ]:
def scatterplot(col1,col2,c):
  """
  input : col1 : along x axis 
          col2 : along y axis 
             c : color to plot the scatter plot
  plots : scatter plot of given two features col1 and col2
  """
  plt.figure(figsize=(10,5))
  sns.scatterplot(x=col1,y=col2,data=train_df,color=c)
color=['black','red','green','orange','blue','purple','grey']
c=0
for i in range(len(discrete_cont)-1):
  for j in range(i+1,len(discrete_cont)):
    scatterplot(discrete_cont[i],discrete_cont[j],color[c])
    c+=1
    if c>6:
      c=0
In [ ]:
for i in range(len(continuous_cols)-1):
  for j in range(i+1,len(continuous_cols)):
    scatterplot(continuous_cols[i],continuous_cols[j],color[c])
    c+=1
    if c>6:
      c=0

Note :

  1. We could not see much correlation among continuous features.

  2. Features (15,16),(15,17) are correlated among all the combinations of continuous features

In [ ]:
# non_categorical cols
non_cat=continuous_cols
non_cat.extend(discrete_cont)
In [ ]:
def histogram(col):
  """
  input : col - feature
  plots : distplot,rug plot and kde plot for given input feature(col)
  """
  plt.figure(figsize=(10,5))
  fig, ((ax1,ax2,ax3)) = plt.subplots(nrows=1,ncols=3,figsize=(20,5))
  sns.distplot(train_df[col],kde=False,rug=False,ax=ax1,color='black')
  sns.distplot(train_df[col],kde=True,rug=False,ax=ax2,color='red')
  sns.distplot(train_df[col],kde=True,rug=True,ax=ax3,color='green')
In [ ]:
for x in non_cat:
  histogram(x)
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>

Note :

  1. Some of the features such as Feature 2,17,19 etc were Right Skewed.

  2. Feature 24 has left skewed distribution.

  3. Features 1,23 must be a random distribution.

  4. Some features (22,20,24) have bimodal distribution

In [ ]:
def BoxViolinPlots(col):
  """
  input : col - feature 
  plots : boxplots and violin plots to visualize outliers 
  """
  plt.figure(figsize=(10,5))
  fig, ((ax1,ax2)) = plt.subplots(nrows=1,ncols=2,figsize=(12,5))
  sns.boxplot(y=train_df[col],ax=ax1,color='orange')
  sns.violinplot(y=train_df[col],ax=ax2,color='red')
  ax1.set_xlabel(col)
  ax1.set_ylabel("Observed Values")
  ax2.set_xlabel(col)
  ax2.set_ylabel("Observed Values")
In [ ]:
for x in cols_train:
  BoxViolinPlots(x)
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>
<Figure size 720x360 with 0 Axes>

Note :

  1. Both Box Plots and Violin Plots used to visualize the outliers.

  2. Features (4,7,9,12,17,20) has many outlier points and we should check its impact on models while model building process.

DATA PREPROCESSING :

Remove Highly Correlated Features:

In [ ]:
# removing highly correlated data 
def return_uncorrelated_dataset(df,test,threshold):
  """
  input : df  - train data 
          test- test data 
          threshold - maximum correlation between two features. If the correlation between two features is more than 
                      threshold, remove one feature
  return : train and test data with correlated features removed
  """
  correlation_matrix=df.corr()
  cols=df.columns.tolist()
  l=len(cols)
  threshold_limit_columns=[True]*l
  threshold_limit_columns[l-1]=False
  for i in range(l):
    for j in range(i+1,l):
      if correlation_matrix.iloc[i,j]>threshold:
        if threshold_limit_columns[j]!=False:
          threshold_limit_columns[j]=False
  uncorrelated_features=df.columns[threshold_limit_columns]
  return df[uncorrelated_features],test[uncorrelated_features]

Missing Values Handling and Imputation :

In [ ]:
# identifying the missing values 
def Missing_Value_Checker(df,threshold=0):
  """
  input : df - train data 
          threshold - default as 0 --- to display all features with missing values 
                      can be any value - to display all features with missing values above the threshold
  prints : number of features with missing values greater than the threshold 
           features along with their missing values 
  """
  print("\n Check for NaN values in each features :\n")
  missing_val_percent=round(df.isnull().sum()/len(df)*100,2)
  missing_val_percent=missing_val_percent.sort_values(ascending=False).where(missing_val_percent>threshold)
  print(missing_val_percent[~missing_val_percent.isnull()])
  print("\n Number of features with missing values :",missing_val_percent[~missing_val_percent.isnull()].count())

#drop the missing value column above the threshold mentioned 
def drop_missing_values(df,threshold):
  """
  input : df- train data 
          threshold - percentage of missing values 
  returns : df with features have the missing values less than threshold
  """
  missing_val_percent=round(df.isnull().sum()/len(df)*100,2)
  missing_val_percent=missing_val_percent.sort_values(ascending=False).where(missing_val_percent>threshold)
  remove_cols=list(missing_val_percent[~missing_val_percent.isnull()].index)
  df_new=df.drop(remove_cols,axis=1)
  return df_new

# imputing the feature with median as imputing median is not affected by outliers
def imputing_missingvalues(df,test,missing_col,imputed_val):
  """
  input : df- train 
          test - test data
          missing_col - column which has missing value to be imputed 
          imputed_val - value to be imputed inplace of NA in given missing_col
  returns : df,test with missing value imputed for the column passed with given imputed_val
  """
  df[missing_col].fillna(imputed_val,inplace=True)
  test[missing_col].fillna(imputed_val,inplace=True)
  return df,test

Handling Data Duplications:

In [ ]:
def handling_duplicate_entries(df):
  """
  input : df-train 
  returns : df with duplicates of rows and columns will be removed
  """
  # to check if any row in the train data duplicated ------ DUPLICATE ROWS 
  print("\nTrain data before duplicates removed : ",df.shape)
  df=df.drop_duplicates()
  # DUPLICATE COLUMNS 
  df=df.T.drop_duplicates().T
  print("Train data after duplicates removed : ",df.shape)
  return df 

Removing Outlier Points :

In [ ]:
def remove_outlier(df):
  """
  input : df-train 
  returns : df with target variable count >3 
  """
  df_new=df.groupby("Target Variable (Discrete)").filter(lambda x: len(x) >3)
  return df_new

Train Validation Split :

In [ ]:
def TrainTestData(X,y):
  """
  input : X  - input features 
          y  - output target variable 
  returns : X_train,X_val,y_train,y_val
  """
  X_train,X_val,y_train,y_val=train_test_split(X,y,test_size=0.3,random_state=42)
  return X_train,X_val,y_train,y_val

Data Normalization :

In [ ]:
def NormalizeData(train,val,test,normalizer):
  """
  input : train , val ,test ----> train , validation and test data 
          normalizer : 1    MinMaxScaler
                      -1    No normalization 
                       else StandardScaler
  returns : train,validation and test data normalized versions
  """
  if normalizer==1:
    normalizer=MinMaxScaler()
  elif normalizer==-1:
    return train,val,test
  else:
    normalizer=StandardScaler()
  train=normalizer.fit_transform(train)
  val=normalizer.transform(val)
  test=normalizer.transform(test)
  return train,val,test 

Download the Dataset :

Download the data to be given to input model if the modelling is done in seperate python notebook. Otherwise we could omit the download section.

In [ ]:
def DownloadData(data,filename):
  """
  input : data - data to be downloaded
         filename- filename to be downloaded in local storage 
  Downloads : data in local storage
  """
  np.save(filename,data)
  from google.colab import files
  files.download(filename)

def DownloadAll(train,trainname,val,valname,test,testname,ytrain,ytrainname,yval,yvalname):
  """
  input : train,val,test,ytrain,yval - data to be downloaded
         trainname,valname,testname,ytrainname,yvalname- filename to be downloaded in local storage 
  Downloads : data in local storage
  """
  DownloadData(train,trainname)
  DownloadData(val,valname)
  DownloadData(test,testname)
  DownloadData(ytrain,ytrainname)
  DownloadData(yval,yvalname)
  print("DATASETS DOWNLOADED")

CREATING DATASET :

In [ ]:
def getData(train,test,correlation_threshold,missing_value_threshold,imputation_method,normalize,outlier=1):
  """
  input    :     train --- train data 
                 test ---- test data 
                 correlation_threshold - value above which correlation between two features is removed
                 missing_value_threshold -feature having missing value above threshold will be removed
                 imputation_method : mean or median or any imputation technique 
                 normalize : 1 - MinMax 
                             else -StandardScaler 
                 Outlier : default 1 -remove outlier 
                            else - dont remove outlier
  returns   :    X_train,X_val,X_test,y_train,y_val
  """
  output_data=train["Target Variable (Discrete)"]
  train,test=return_uncorrelated_dataset(train,test,correlation_threshold)
  train["Target Variable (Discrete)"]=output_data
  Missing_Value_Checker(train)
  train=drop_missing_values(train,missing_value_threshold)
  test=test[train.drop("Target Variable (Discrete)",axis=1).columns]
  missing_cols=train.columns[train.isnull().any()]
  for cols in missing_cols:
    if imputation_method=="mean":
      imputed_val=train[cols].mean()
    elif imputation_method=="median":
      imputed_val=train[cols].median()
    train,test=imputing_missingvalues(train,test,cols,imputed_val)
  train=handling_duplicate_entries(train)
  if outlier==1:
    train=remove_outlier(train)
  y=train["Target Variable (Discrete)"]
  X=train.drop("Target Variable (Discrete)",axis=1)
  X_train,X_val,y_train,y_val=TrainTestData(X,y)
  X_train,X_val,X_test=NormalizeData(X_train,X_val,test,normalize)
  return X_train,X_val,X_test,y_train,y_val
In [ ]:
"""
Strategy of Preprocessing Used:
             remove features which are 95% correlated 
             remove feature which has missing values more than 50%
             impute missing values using mean 
             remove outlier in target variable
             MinMaxScaler
"""             
X_train,X_val,X_test,y_train,y_val = getData(train_df,test_df,0.95,50,"mean",1)
 Check for NaN values in each features :

Feature 17    54.93
Feature 18    33.20
Feature 15     7.24
Feature 9      1.41
Feature 11     0.10
Feature 13     0.10
Feature 12     0.10
Feature 24     0.10
Feature 10     0.10
dtype: float64

 Number of features with missing values : 9

Train data before duplicates removed :  (994, 21)
Train data after duplicates removed :  (994, 21)

MODELLING :

Machine Learning Models -- Hyperparameter Tuning :

In [ ]:
def BestParams_GridSearchCV(algo,X,y,hyperparams,folds):
  """
  input   : algo : classification algorithm 
             X : X_train 
             y : y_train 
            param_grid: hyperparameters for the ML algorithm passed
  prints  : best hyperparameters 
  """
  # Instantiate the grid search model
  grid_search = GridSearchCV(estimator = algo, param_grid =hyperparams, 
                          cv =folds,scoring="accuracy",n_jobs = -1,verbose = 1,return_train_score=True)
  # Fit the grid search to the data
  grid_search.fit(X, y)
  results=pd.DataFrame(grid_search.cv_results_)
  return results 

def HyperParamsResultsPlot(results,modelname):
  """
  input : results_acc : grid search results got using accuracy as performance measure 
          modelname   : Machine Learning Model used 
  prints : best hyperparameters 
  plots  : hyperparameters plots for given model 
  """
  print("---------------------------------------------------------------------------------------------------------")
  print(modelname)
  print("---------------------------------------------------------------------------------------------------------")
  params=results.sort_values("mean_test_score",ascending=True)
  params['ID']=[i for i in range(1,len(params['mean_test_score'])+1)]
  plt.figure(figsize=(8,5))
  fig, ((ax1)) = plt.subplots(nrows=1,ncols=1,figsize=(20,5))
  ax1.plot(params['ID'],params['mean_test_score'],color='green')
  ax1.plot(params['ID'],params['mean_train_score'],color='red')
  ax1.set_title("Accuracy score using Hyperparameters of "+modelname)
  ax1.set_xlabel("Hyperparameter ID")
  ax1.set_ylabel("Accuracy score")
  best_acc_params=params['params'].iloc[-1]
  print("Best hyparameters using accuracy as performance measure : \n",best_acc_params)
  return best_acc_params

def main_results(best_m,train,val,test,ytrain,yval):
  """
  inputs : best_m - best classifier (may be best random forest model,xgboost model etc)
           train,val,test,ytrain,yval- data after preprocessesing 
  prints : validation accuracy 
  returns: test predictions 
  """
  best_m.fit(train,ytrain)
  y_pred = pd.Series(best_m.predict(val))
  print("\nValidation Accuracy : ",accuracy_score(yval,y_pred))
  test_pred=pd.Series(best_m.predict(test))
  return test_pred 

def submission(ytest,download=0,main_submission=0):
  """
  inputs : ytest - ytest predicted data 
           download - default 0 - don't download the submission file (.csv) for kaggle (just for visualization purpose)
                      else - download the submission file (.csv) for kaggle 
           main_submission -  default 0 - we don't consider that submission as final submission
                              otherwise - take as main submission file and return the submission file 
  returns : predicted target variable count of each classes
  """
  submission_file=pd.DataFrame()
  submission_file['Id']=range(1,len(X_test)+1)
  submission_file['Category']=ytest 
  submission_file['Category']=submission_file['Category'].astype(int)
  submission_file.to_csv("test_output.csv",index=False)   # test_output.csv is stored in Current Working Directory
  if download!=0:
    files.download('test_output.csv')
  if main_submission!=0:
    return submission_file
  print("\n Test Predicted Category counts : \n")
  return submission_file['Category'].value_counts()
RANDOM FOREST MODEL:
In [ ]:
folds = KFold(n_splits = 5, shuffle = True, random_state = random_state)
In [ ]:
rf=RandomForestClassifier(random_state=random_state)
hyper_params_rf={
    'max_depth': [8,10,12],
    'min_samples_leaf':[1,2,3,4],
    'min_samples_split': [5,10,20],
    'n_estimators': [200,300], 
    'max_features': [15,20],
     'class_weight':['balanced']
}
In [ ]:
best_acc_params_rf=HyperParamsResultsPlot(BestParams_GridSearchCV(rf,X_train,y_train,hyper_params_rf,folds),"RANDOM FOREST")
Fitting 5 folds for each of 144 candidates, totalling 720 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   33.6s
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 446 tasks      | elapsed:  5.6min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:  9.1min finished
---------------------------------------------------------------------------------------------------------
RANDOM FOREST
---------------------------------------------------------------------------------------------------------
Best hyparameters using accuracy as performance measure : 
 {'class_weight': 'balanced', 'max_depth': 10, 'max_features': 15, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 300}
<Figure size 576x360 with 0 Axes>
XGBoost :
In [ ]:
xg=XGBClassifier(random_state=random_state)
hyper_params_xg={
    'max_depth': [3,8,10],
    'min_child_weight':[1,3,5],
    'min_samples_split': [5,10],
    'n_estimators': [200,300], 
    'max_features': [10,15,20],
     'class_weight':['balanced']
}
In [ ]:
best_acc_params_xg=HyperParamsResultsPlot(BestParams_GridSearchCV(xg,X_train,y_train,hyper_params_xg,folds),"XGBoost")
Fitting 5 folds for each of 108 candidates, totalling 540 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   30.5s
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 446 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done 540 out of 540 | elapsed:  6.2min finished
---------------------------------------------------------------------------------------------------------
XGBoost
---------------------------------------------------------------------------------------------------------
Best hyparameters using accuracy as performance measure : 
 {'class_weight': 'balanced', 'max_depth': 8, 'max_features': 20, 'min_child_weight': 1, 'min_samples_split': 10, 'n_estimators': 300}
<Figure size 576x360 with 0 Axes>
GBDT :
In [ ]:
gbdt=GradientBoostingClassifier(random_state=random_state)
hyper_params_gbdt={
    'max_depth': [3,8,10],
    'min_samples_leaf':[1,3,5],
    'min_samples_split': [5,10],
    'n_estimators': [200,300], 
    'max_features': [10,15,20]
    }
In [ ]:
best_acc_params_gbdt=HyperParamsResultsPlot(BestParams_GridSearchCV(gbdt,X_train,y_train,hyper_params_gbdt,folds),"GBDT")
Fitting 5 folds for each of 108 candidates, totalling 540 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:  7.3min
[Parallel(n_jobs=-1)]: Done 446 tasks      | elapsed: 15.7min
[Parallel(n_jobs=-1)]: Done 540 out of 540 | elapsed: 19.3min finished
---------------------------------------------------------------------------------------------------------
GBDT
---------------------------------------------------------------------------------------------------------
Best hyparameters using accuracy as performance measure : 
 {'max_depth': 3, 'max_features': 10, 'min_samples_leaf': 3, 'min_samples_split': 10, 'n_estimators': 300}
<Figure size 576x360 with 0 Axes>
SVM :
In [ ]:
svm=SVC(kernel='rbf',random_state=random_state)
hyper_params_svm={
    'C':[0.01,0.1,0.5,1,10,100,1000],
    'gamma':['auto','scale'],
    'class_weight':['balanced']
}
In [ ]:
best_acc_params_svm=HyperParamsResultsPlot(BestParams_GridSearchCV(svm,X_train,y_train,hyper_params_svm,folds),"SVM")
Fitting 5 folds for each of 14 candidates, totalling 70 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
---------------------------------------------------------------------------------------------------------
SVM
---------------------------------------------------------------------------------------------------------
Best hyparameters using accuracy as performance measure : 
 {'C': 1000, 'class_weight': 'balanced', 'gamma': 'scale'}
[Parallel(n_jobs=-1)]: Done  70 out of  70 | elapsed:    3.0s finished
<Figure size 576x360 with 0 Axes>
MULTILAYER PERCEPTRON :
In [ ]:
mlp=MLPClassifier()
hyper_params_mlp={
    'hidden_layer_sizes':[(100,),(200,),(300,)],
    'activation':['tanh','logistic','relu'],
    'alpha':[0.0001,0.001,0.01]
}
In [ ]:
best_acc_params_mlp=HyperParamsResultsPlot(BestParams_GridSearchCV(mlp,X_train,y_train,hyper_params_mlp,folds),"Multilayer Perceptron")
Fitting 5 folds for each of 27 candidates, totalling 135 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   42.7s
[Parallel(n_jobs=-1)]: Done 135 out of 135 | elapsed:  1.8min finished
---------------------------------------------------------------------------------------------------------
Multilayer Perceptron
---------------------------------------------------------------------------------------------------------
Best hyparameters using accuracy as performance measure : 
 {'activation': 'relu', 'alpha': 0.01, 'hidden_layer_sizes': (300,)}
<Figure size 576x360 with 0 Axes>

Best Models :

Best Random Forest Model :
In [ ]:
# best rf model 
best_rf=RandomForestClassifier(random_state=random_state,max_depth=12,max_features=15,min_samples_leaf=1,min_samples_split=5,n_estimators=200)
rf_test_pred=main_results(best_rf,X_train,X_val,X_test,y_train,y_val)
submission(rf_test_pred)
Validation Accuracy :  0.9319727891156463

 Test Predicted Category counts : 

Out[ ]:
1.0     231
0.0     104
2.0      46
6.0      26
5.0      17
14.0      2
Name: Category, dtype: int64
Best XGBoost Model :
In [ ]:
# best xgboost model 
best_xg=XGBClassifier(class_weight='balanced',max_depth=8,max_features=10,min_child_weight=1,min_samples_split=5,n_estimators=300,random_state=random_state)
xg_test_pred=main_results(best_xg,X_train,X_val,X_test,y_train,y_val)
submission(xg_test_pred)
Validation Accuracy :  0.9319727891156463

 Test Predicted Category counts : 

Out[ ]:
1.0     226
0.0     106
2.0      47
6.0      27
5.0      16
14.0      2
15.0      1
8.0       1
Name: Category, dtype: int64
Best GBDT Model :
In [ ]:
# best gbdt model 
best_gbdt=GradientBoostingClassifier(max_depth=10,max_features=15,min_samples_leaf=1,min_samples_split=10,n_estimators=200,random_state=random_state)
gbdt_test_pred=main_results(best_gbdt,X_train,X_val,X_test,y_train,y_val)
submission(gbdt_test_pred)
Validation Accuracy :  0.9251700680272109

 Test Predicted Category counts : 

Out[ ]:
1.0     226
0.0     104
2.0      43
6.0      27
5.0      17
8.0       4
14.0      3
15.0      1
7.0       1
Name: Category, dtype: int64
Best SVM Model :
In [ ]:
# best svm model 
best_svm=SVC(C=1000,class_weight='balanced',gamma='scale',probability=True,random_state=random_state)
svm_test_pred=main_results(best_svm,X_train,X_val,X_test,y_train,y_val)
submission(svm_test_pred)
Validation Accuracy :  0.8877551020408163

 Test Predicted Category counts : 

Out[ ]:
1.0     232
0.0      99
2.0      37
6.0      29
5.0      17
15.0      5
8.0       4
14.0      3
Name: Category, dtype: int64
Best MLP Model :
In [ ]:
# best mlp model 
best_mlp=MLPClassifier(activation='relu',alpha=0.01,hidden_layer_sizes=(300,),random_state=random_state)
mlp_test_pred=main_results(best_mlp,X_train,X_val,X_test,y_train,y_val)
submission(mlp_test_pred)
Validation Accuracy :  0.8775510204081632

 Test Predicted Category counts : 

Out[ ]:
1.0     236
0.0     103
2.0      47
6.0      29
5.0      10
14.0      1
Name: Category, dtype: int64
Voting Classifier (Type - HARD) :
In [ ]:
# hard voting classifier 
vcl_hard=VotingClassifier(estimators=[('rf',best_rf),('xgb',best_xg),('gbdt',best_gbdt)],voting='hard')
vcl_hard_test_pred=main_results(vcl_hard,X_train,X_val,X_test,y_train,y_val)
submission(vcl_hard_test_pred)
Validation Accuracy :  0.935374149659864

 Test Predicted Category counts : 

Out[ ]:
1.0     227
0.0     105
2.0      46
6.0      27
5.0      17
14.0      2
15.0      1
8.0       1
Name: Category, dtype: int64
Voting Classifier (Type - SOFT) :
In [ ]:
# soft voting classifier 
vcl_soft=VotingClassifier(estimators=[('rf',best_rf),('xgb',best_xg),('gbdt',best_gbdt)],voting='soft',weights=[0.35,0.35,0.3])
vcl_soft_test_pred=main_results(vcl_soft,X_train,X_val,X_test,y_train,y_val)
submission(vcl_soft_test_pred)
Validation Accuracy :  0.935374149659864

 Test Predicted Category counts : 

Out[ ]:
1.0     227
0.0     105
2.0      46
6.0      27
5.0      16
14.0      3
15.0      1
8.0       1
Name: Category, dtype: int64

RESULTS :

Final Model :

In [ ]:
# soft voting classifier 
final_clf=VotingClassifier(estimators=[('rf',best_rf),('xgb',best_xg),('gbdt',best_gbdt)],voting='soft',weights=[0.35,0.35,0.3])
test_pred=main_results(final_clf,X_train,X_val,X_test,y_train,y_val)
Validation Accuracy :  0.935374149659864

Submission File Generation :

In [ ]:
# downloading the main submission file for kaggle
submission(test_pred,1)
 Test Predicted Category counts : 

Out[ ]:
1.0     227
0.0     105
2.0      46
6.0      27
5.0      16
14.0      3
15.0      1
8.0       1
Name: Category, dtype: int64
In [ ]:
# main kaggle submission file 
'''

'''
final_submission = submission(test_pred,0,1)
final_submission
Out[ ]:
Id Category
0 1 6.0
1 2 2.0
2 3 1.0
3 4 1.0
4 5 1.0
5 6 2.0
6 7 1.0
7 8 5.0
8 9 5.0
9 10 0.0
10 11 1.0
11 12 6.0
12 13 0.0
13 14 1.0
14 15 2.0
15 16 0.0
16 17 1.0
17 18 1.0
18 19 1.0
19 20 1.0
20 21 5.0
21 22 5.0
22 23 1.0
23 24 1.0
24 25 0.0
25 26 2.0
26 27 6.0
27 28 0.0
28 29 1.0
29 30 0.0
30 31 0.0
31 32 1.0
32 33 1.0
33 34 6.0
34 35 1.0
35 36 0.0
36 37 6.0
37 38 0.0
38 39 0.0
39 40 1.0
40 41 14.0
41 42 1.0
42 43 1.0
43 44 1.0
44 45 0.0
45 46 1.0
46 47 1.0
47 48 0.0
48 49 6.0
49 50 1.0
50 51 2.0
51 52 1.0
52 53 1.0
53 54 1.0
54 55 1.0
55 56 1.0
56 57 6.0
57 58 2.0
58 59 0.0
59 60 0.0
60 61 1.0
61 62 1.0
62 63 1.0
63 64 6.0
64 65 1.0
65 66 1.0
66 67 2.0
67 68 2.0
68 69 1.0
69 70 2.0
70 71 1.0
71 72 2.0
72 73 1.0
73 74 1.0
74 75 6.0
75 76 1.0
76 77 1.0
77 78 1.0
78 79 1.0
79 80 1.0
80 81 1.0
81 82 1.0
82 83 0.0
83 84 1.0
84 85 6.0
85 86 0.0
86 87 1.0
87 88 1.0
88 89 1.0
89 90 1.0
90 91 1.0
91 92 0.0
92 93 2.0
93 94 1.0
94 95 1.0
95 96 1.0
96 97 0.0
97 98 1.0
98 99 2.0
99 100 0.0
100 101 1.0
101 102 0.0
102 103 0.0
103 104 1.0
104 105 1.0
105 106 1.0
106 107 1.0
107 108 0.0
108 109 0.0
109 110 2.0
110 111 1.0
111 112 0.0
112 113 0.0
113 114 0.0
114 115 0.0
115 116 6.0
116 117 0.0
117 118 1.0
118 119 2.0
119 120 1.0
120 121 0.0
121 122 1.0
122 123 1.0
123 124 1.0
124 125 6.0
125 126 1.0
126 127 1.0
127 128 1.0
128 129 1.0
129 130 1.0
130 131 1.0
131 132 1.0
132 133 1.0
133 134 0.0
134 135 0.0
135 136 2.0
136 137 8.0
137 138 2.0
138 139 1.0
139 140 0.0
140 141 0.0
141 142 1.0
142 143 0.0
143 144 2.0
144 145 1.0
145 146 1.0
146 147 6.0
147 148 1.0
148 149 2.0
149 150 0.0
150 151 1.0
151 152 1.0
152 153 6.0
153 154 1.0
154 155 1.0
155 156 1.0
156 157 1.0
157 158 0.0
158 159 1.0
159 160 1.0
160 161 0.0
161 162 1.0
162 163 0.0
163 164 0.0
164 165 2.0
165 166 1.0
166 167 0.0
167 168 0.0
168 169 0.0
169 170 1.0
170 171 0.0
171 172 0.0
172 173 1.0
173 174 1.0
174 175 1.0
175 176 1.0
176 177 1.0
177 178 1.0
178 179 0.0
179 180 0.0
180 181 2.0
181 182 5.0
182 183 1.0
183 184 1.0
184 185 2.0
185 186 1.0
186 187 1.0
187 188 2.0
188 189 1.0
189 190 1.0
190 191 5.0
191 192 0.0
192 193 2.0
193 194 6.0
194 195 1.0
195 196 1.0
196 197 0.0
197 198 1.0
198 199 1.0
199 200 1.0
200 201 1.0
201 202 0.0
202 203 1.0
203 204 6.0
204 205 1.0
205 206 1.0
206 207 1.0
207 208 0.0
208 209 1.0
209 210 14.0
210 211 0.0
211 212 0.0
212 213 0.0
213 214 0.0
214 215 2.0
215 216 1.0
216 217 1.0
217 218 0.0
218 219 1.0
219 220 1.0
220 221 1.0
221 222 1.0
222 223 0.0
223 224 1.0
224 225 1.0
225 226 6.0
226 227 1.0
227 228 1.0
228 229 0.0
229 230 1.0
230 231 2.0
231 232 1.0
232 233 1.0
233 234 5.0
234 235 1.0
235 236 0.0
236 237 1.0
237 238 1.0
238 239 0.0
239 240 0.0
240 241 0.0
241 242 1.0
242 243 1.0
243 244 2.0
244 245 6.0
245 246 5.0
246 247 1.0
247 248 6.0
248 249 0.0
249 250 1.0
250 251 0.0
251 252 5.0
252 253 1.0
253 254 1.0
254 255 1.0
255 256 15.0
256 257 0.0
257 258 0.0
258 259 0.0
259 260 1.0
260 261 6.0
261 262 1.0
262 263 1.0
263 264 0.0
264 265 0.0
265 266 0.0
266 267 0.0
267 268 0.0
268 269 0.0
269 270 0.0
270 271 0.0
271 272 0.0
272 273 1.0
273 274 2.0
274 275 5.0
275 276 6.0
276 277 0.0
277 278 0.0
278 279 2.0
279 280 1.0
280 281 1.0
281 282 1.0
282 283 1.0
283 284 5.0
284 285 5.0
285 286 1.0
286 287 2.0
287 288 1.0
288 289 0.0
289 290 1.0
290 291 0.0
291 292 1.0
292 293 1.0
293 294 6.0
294 295 0.0
295 296 5.0
296 297 0.0
297 298 1.0
298 299 14.0
299 300 2.0
300 301 1.0
301 302 1.0
302 303 1.0
303 304 1.0
304 305 2.0
305 306 1.0
306 307 1.0
307 308 1.0
308 309 6.0
309 310 1.0
310 311 2.0
311 312 1.0
312 313 1.0
313 314 0.0
314 315 1.0
315 316 1.0
316 317 2.0
317 318 1.0
318 319 2.0
319 320 1.0
320 321 1.0
321 322 1.0
322 323 0.0
323 324 0.0
324 325 1.0
325 326 0.0
326 327 1.0
327 328 0.0
328 329 1.0
329 330 1.0
330 331 0.0
331 332 0.0
332 333 1.0
333 334 1.0
334 335 1.0
335 336 1.0
336 337 1.0
337 338 1.0
338 339 1.0
339 340 1.0
340 341 0.0
341 342 1.0
342 343 1.0
343 344 1.0
344 345 0.0
345 346 2.0
346 347 0.0
347 348 0.0
348 349 1.0
349 350 1.0
350 351 1.0
351 352 1.0
352 353 2.0
353 354 5.0
354 355 1.0
355 356 2.0
356 357 0.0
357 358 6.0
358 359 2.0
359 360 6.0
360 361 2.0
361 362 1.0
362 363 2.0
363 364 0.0
364 365 6.0
365 366 0.0
366 367 1.0
367 368 0.0
368 369 1.0
369 370 0.0
370 371 1.0
371 372 1.0
372 373 1.0
373 374 1.0
374 375 0.0
375 376 1.0
376 377 1.0
377 378 1.0
378 379 0.0
379 380 1.0
380 381 5.0
381 382 1.0
382 383 2.0
383 384 1.0
384 385 1.0
385 386 0.0
386 387 2.0
387 388 1.0
388 389 1.0
389 390 1.0
390 391 1.0
391 392 1.0
392 393 0.0
393 394 1.0
394 395 1.0
395 396 1.0
396 397 5.0
397 398 1.0
398 399 1.0
399 400 2.0
400 401 1.0
401 402 1.0
402 403 6.0
403 404 2.0
404 405 2.0
405 406 0.0
406 407 1.0
407 408 1.0
408 409 1.0
409 410 1.0
410 411 1.0
411 412 1.0
412 413 2.0
413 414 1.0
414 415 1.0
415 416 1.0
416 417 0.0
417 418 1.0
418 419 0.0
419 420 1.0
420 421 0.0
421 422 1.0
422 423 1.0
423 424 1.0
424 425 1.0
425 426 1.0

FoML Competition Test Dataset Accuracy :

KAGGLE ACCURACY SCORE : 0.94366 kaggle result 0.9435.PNG

In [4]:
!jupyter nbconvert --to html /content/ASSIGNMENT_2_FML_FINAL_SUBMISSION_FILE.ipynb
[NbConvertApp] Converting notebook /content/ASSIGNMENT_2_FML_FINAL_SUBMISSION_FILE.ipynb to html
[NbConvertApp] Writing 4446072 bytes to /content/ASSIGNMENT_2_FML_FINAL_SUBMISSION_FILE.html